================================== Data Extraction and Classification ================================== ----------------------- Lexicons and Embeddings ----------------------- Lexicons ======== **Lexicons** hold lists of words used for searching and can also be used to find similar words with the help of :ref:`Embeddings`. To access lexicons, click on the **Lexicons** button in the menu. Besides creating a new lexicon, you have the option to merge lexicons. .. image:: /_static/lexicons1.png :scale: 80% 1. To create a lexicon, click on the **Create** button and enter a description, then click on **Create**. .. image:: /_static/lexicons2.png :scale: 80% 2. Now the lexicon is created, but doesn’t have any words. You can delete the lexicon or edit its name with the **Delete/Edit** buttons. .. image:: /_static/lexicons3.png :scale: 60% 3. Click on the lexicon to see its contents. You can see four large boxes. The most important of these is the upper right box called **Positives**, you can ignore the others for now. .. image:: /_static/lexicons4.png :scale: 40% 4. In the upper left you have the option to select an **Embedding**, on the right you have the buttons **Clear Lexicon** and **Save**. .. image:: /_static/lexicons5.png :scale: 60% 5. Enter some keywords in the **Positives** box, then press **Save**. .. image:: /_static/lexicons6.png :scale: 80% 6. If we go back to the **Lexicons** page, we can see we have five words in our lexicon now. |br| ------ Using Lexicons in a Search -------------------------- You can use a lexicon to conduct a search. 1. Create a lexicon. 2. After you've created a lexicon, go back to the **Search** view and start an :ref:`Advanced search`. .. image:: /_static/lexicons7.png :scale: 60% 3. After choosing the text field for the search, we have the option to **Add lexicon**. .. image:: /_static/lexicons8.png :scale: 60% 4. Choose the lexicon and click on it. Hovering over it with the cursor will also show you some of the words in the lexicon. .. image:: /_static/lex-long.png :scale: 60% 5. Press **Search** to search for these words. You can add more words to the search query or change the search constraints without changing the lexicon. You can also change the :ref:`Operators`. |br| ------ Embeddings ========== **Embeddings** create a vector representation of a dataset. Embeddings are used for suggesting similar words (for example, to add to your lexicon) and are also used for training some models. 1. To train an embedding go to the **Models** menu and select **Embeddings**, then press **Create** to create an embedding. .. image:: /_static/embedding1.png :scale: 60% 2. Fill in a description, select indices and fields used to build the model. To get keywords, you should use lemmatized data. 3. You can also use a query here. .. image:: /_static/embedding-type.png :scale: 80% 4. Select the **Embedding Type**, *W2VEmbedding* or *FastTextEmbedding*: Word2vec creates word embeddings based on the entire word, while fastText generates embeddings based on the subword units of a word. 5. Choose the **Minimum Frequency**, the minimum amount a word or subword has to appear; **Number of Dimensions**, more dimensions leads to better results but slower creation; and **Context Window Size**, the number of words used to determine the context of each word. .. image:: /_static/embedding-epoch.png :scale: 80% 6. Select the number of **Epochs**, the number of times the corpus will be looked at. 7. You can also use the phraser, which will chunk up related words into phrases. 8. Press **Create** to create the embedding. This will take a little bit of time. You can refresh the page to see the progress. |br| ------ Using embeddings for generating new words into a lexicon -------------------------------------------------------- 1. After training an embedding, you can go back to your test lexicon. .. image:: /_static/embedding-lexicon.png :scale: 80% 2. Select the test embedding. .. image:: /_static/embedding-suggestions1.png :scale: 80% 3. Now you can press on the **New Suggestions** button in the upper right. .. image:: /_static/embedding-suggestions2.png :scale: 60% 4. The new suggestions will appear on the far right. If you click on a suggestion, it will automatically be added to **Positives** and used to find more keywords. 5. You can use the arrows in the middle to exchange the **Positives** with **Positives Unused** (will not be used to give new suggestions but will remain in the lexicon) and **Negatives** with **Negatives Unused**. The **Negatives** will also affect the new suggestions. .. image:: /_static/lexicon-move1.png :scale: 40% .. image:: /_static/lexicon-move2.png :scale: 40% 6. After selecting all the new suggestion words into **Positives**, you can click on the **New Suggestions** button again to receive new suggestions. The words that weren’t selected automatically move into the **Negatives Unused** section. 7. After you have received all the suggestions you need, click on **Save** in the upper right to save the lexicon. |br| ------ -------------------------------- Metadata Extraction and Labeling -------------------------------- Let’s say you’ve identified something similar about a group of documents, maybe they all contain a certain word or a phrase/list of words in a certain field. By finding all the documents that contain these similarities, you can generate metadata facts. If there are certain set words or phrases associated with the metadata, then it is easy to find using :ref:`Search Query Taggers`. In some cases, the keywords are more complicated and require different variants or could have misspellings. In these cases a :ref:`regular expression ` makes more sense. Search Query Taggers ==================== **Search Query Taggers** use search queries as the base for extracting data from a given field. 1. Make sure you have a query ready. 2. To access Search Query Taggers, click on **Models** and select **Search Query Taggers**. Select **Create** from the menu. .. image:: /_static/search-query-tagger1.png :scale: 70% 3. Insert the task description, choose the indices and fields. Consider that the index you choose will obtain a new fact during this process. 4. Select the saved query for this metadata, for example a query for the name of a university. In this example case, there is no query and the test metadata will be associated with all documents. 5. Select the **Fact name**, this should be a general category, like Genre, University, Discipline etc. 6. Select the **Fact value**. Examples: genre type like *MA Thesis*, university name like *University of Latvia*, discipline type like *Social science* etc. 7. Then click **Create** to create the task. .. image:: /_static/search-query-tagger2.png :scale: 40% 8. The result of adding the test fact. |br| ------ Regex Taggers ============= **Regex Taggers** are a way to create more complex search queries with regular expressions. 1. To use Regex Taggers, go to **Models** and open **Regex Taggers**. .. image:: /_static/regex-taggers.png 2. Press **Create** to create a new regex tagger or **Multitag Text** for testing already created regex taggers. Creating a regex tagger creates the regular expression pattern for finding matching documents and you can use the pattern for testing or apply to whole documents and indices. .. image:: /_static/regex-taggers-multitag.png :scale: 70% *An example of the multitag menu* .. image:: /_static/regex-tagger-train.png :scale: 60% 1. To train a new regex tagger, write in regex tagger description. 2. In the lexicon field, you can add a lexicon using the **Add lexicon** button or write in your search terms - you can use Elasticsearch regular expression syntax for these items. 3. Under advanced settings you can use a counter lexicon (words or phrases that automatically negate the match), choose the operator and the match type, as well as phrase slop (how many words can be between between a phrase), counter lexicon slop (how many words between a counter lexicon item and match item) and number of allowed edits (to account of misspellings). 4. You can also choose whether to return fuzzy matches, ignore case or to ignore punctuation. .. image:: /_static/regex-tagger-match.png :scale: 60% 5. Match type can be **prefix**, **exact** or **subword**. * Exact match matches the exact word * Prefix match allows for suffixes to be added to the word * Subword match can also find the word sequence from inside another word - for example searching for *ja* using a subword match will also result in matches like *kajakas*. 6. Press **Create** to create the regex tagger. .. image:: /_static/regex-taggers-actions.png 7. Now the regular expression pattern is created, you can use it in different ways: * For testing matches, use the **Multitag Text** button in the menu or go to the task’s **Actions** and select **Tag Text**. * You can also **Tag Random Doc, Apply to indices, Edit, Duplicate** or **Clone & Edit** and of course **Delete** the task. Some **Tag text** examples: .. image:: /_static/regex-tagger-kool.png :scale: 80% *The regex tagger found a match for "koolis".* .. image:: /_static/regex-tagger-klassifikatsioon.png :scale: 80% *No match found for klassifikatsioonide - the regular expression is to find word that start with "klassi", but the specifications after that rule out "klassifikatsioonide".* .. image:: /_static/regex-tagger-no-match1.png :scale: 80% *A sentence without any of the regex tagger lexicon present.* When the regex tagger is **applied to indices**, consider that it will add a new fact to the index. To apply the regex tagger to indices, select the field the regex pattern will try to match with and optionally, choose the *Fact name* and *Fact value* (by default the tagger match) of your new fact. Adding spans is recommended, since it will allow the Searcher to highlight created facts. .. image:: /_static/regex-tagger-apply.png :scale: 60% An example using the regex tagger trained before. .. image:: /_static/regex-tagger-res.png :scale: 40% The results. |br| ------ ------------------------------------ Keyword Extraction and Tagger Groups ------------------------------------ RaKUn Extractor and Search Fields Tagger ======================================== Let's say you have a special field with keywords in the corpus for each document and you'd like to turn the keywords into accessible metadata. The **Search Fields Tagger** takes keywords stored in a field and turns them into facts. If you'd like to create these keywords based on a :ref:`lexicon ` or by using an :ref:`embedding `, **RaKUn Extractor** extracts keywords to facts using either a lexicon or a FastText Embedding. The RaKUn Extractor can be trained based on lexicons. 1. Create a :ref:`lexicon ` or :ref:`FastTextEmbedding ` first. 2. Create a new RaKUn Extractor task with *editdistance* as algorithm and using your lexicon. If you have a FastText Embedding in the project, you have the option to use that as the algorithm. 3. When the Extractor is trained, open up the **Actions** menu. .. image:: /_static/rakun-actions.png :scale: 80% Example of using **Extract from Text**: .. image:: /_static/rakun2.png :scale: 80% The result: .. image:: /_static/rakun3.png :scale: 80% If applied to an index, RaKUn Extractor creates new facts based on the keywords it finds. You can also add in more words to the lexicon (by clicking on **Stop words** in the actions menu) or change other parameters for potential improvements in the keyword results. |br| ------ Tagger and Regex Tagger Groups ============================== **Tagger Group** creates several Taggers (read more about :ref:`Taggers` here) at once based on a dataset with existing facts. This allows you to use the facts for automatically tagging a new dataset. An example use would be to train an index with discipline in facts and then applying the Tagger Group on another index to detect disciplines based on statistical and lexical similarity. **Regex Tagger Groups** allows you to put together several :ref:`Regex Taggers` and apply them at once to multitag text. |br| ------ -------------- Classification -------------- CRF Extractors ============== If you have some data that has been annotated or some words that have already been extracted into facts using another tool, you could use CRF Extractors to apply this to a larger set of documents. **CRF** or *conditional random field* is a statistical model. 1. Training a CRF model requires an index that has facts you want to use for finding the same features in another index. The facts you choose must be associated with a text_mlp field (for example *text_mlp.lemmas* or *text_mlp.text*). Besides the facts, you also need an :ref:`embedding ` to find similar words. To train a CRF model, go to **Models** and choose **CRF Extractors**. .. image:: /_static/crf-train.png :scale: 60% 2. To train a CRF model, go to **Models** and choose **CRF Extractors**. For this example, we use the *Education_words* fact generated by the regex tagger and the test embedding. .. image:: /_static/crf-features.png :scale: 60% 3. Choose the **Feature fields** which specifies which fields are used to generate features for the model and the **Context feature fields** that specifies which fields are used for context. 4. By default we will be using all the fields for extra linguistic information, but you can specify them as needed. 5. We can also specify the **Feature extractors** and **Context Feature Extractors**. These include: * **isupper** - if word or context word occurs in uppercase or lowercase * **istitle** - if word or context word’s first letter is uppercase * **hasdigit** - word contains a number * **strlength** - takes the word length into account .. image:: /_static/crf-advanced.png :scale: 60% 6. There are also advanced settings, like setting the context window size, setting the test set size, changing the number of iterations, changing the class values or the suffix length. You can also change whether to use bias, which also considers the proportion of the word appearing in the training dataset. 7. Press **Create**. .. image:: /_static/crf-completed.png :scale: 60% 8. Now the model is complete. We can see that the trained model's F1, precision and recall has been evaluated based on the evaluation set size. .. image:: /_static/crf-actions.png :scale: 60% 9. In the **Actions** menu you can **apply the trained extractor to indices, Tag text, List Features, Clone & Edit, Edit (Rename), Retrain** or **Delete**. .. image:: /_static/crf-found-features.png :scale: 60% 10. Using **List Features**, we can view the features associated with the keywords we chose before. For example, the most important feature for our educational words search is to find the lemma *kool*. Another important lemma is *õppima*. .. image:: /_static/crf-apply.png :scale: 60% 11. We apply CRF to an index that doesn't have the *Education fact*. Choose the **MLP fields** (usually **text_mlp**) and the **Label suffix** (will be added to the fact name to differentiate from original fact). 12. You can also add a search query here if you only want the new fact to be created on some documents. 13. Press **Apply**. .. image:: /_static/crf-results.png *Found words in the new index.* |br| ------ BERT Taggers ============ If we have a larger amount of documents with facts, we can train a BERT model to add the same facts to other indices based on the fact's occurrence. BERT taggers are usually trained to detect binary problems (problems with only two possible answers), whether something is true or not. This means that BERT can be used for sentiment analysis or hate speech detection, for example. BERT tagger uses a pre-trained large language model (which you can choose while training) so training the tagger is fine-tuning the selected model for a specific task. Read more about `BERT `_ here. 1. To create a BERT tagger, you should have an index with the facts you want to use for finding the same categories in another index. 2. To train a new BERT tagger, go to **Models** and click on **BERT Taggers**. .. image:: /_static/bert-train-sent.png :scale: 60% 3. Choose the model description, the training indices and fields as well as the fact name to train on. 4. Add a query if you wish. 5. This task can also be used for further fine-tuning another BERT model. .. image:: /_static/add-bert.png 6. Next choose an existing large language model (in the example above, the Estonian and Finnish BERT). You can add pretrained models by clicking on **Add pretrain model** in the BERT Taggers menu and entering the huggingface.co model name (for example, the Finnish and Estonian BERT is at https://huggingface.co/EMBEDDIA/finest-bert). .. image:: /_static/bert-train-sent2.png :scale: 60% 7. Select the minimum number of negative examples and the maximum number of positive examples, the negative multiplier and the maximum sequence length. 8. There is also an option to use GPU to make the training much faster, but since there is no GPU added to the Bwrite toolkit, this option is not available at the moment. 9. Press **Create** and wait for the model to train. 10. For a simple example tagging task, 50 documents were annotated with a sentiment fact that was either negative, positive or neutral as fact values. .. image:: /_static/bert-f1.png The F1, precision and recall scores of the BERT trained on the sentiment. .. image:: /_static/bert-labels.png The number of examples was very small. 11. After training the BERT tagger, you have many options: * For data science and model improvement purposes you can look at **Epoch reports** and **Confusion matrix**. * You can **apply** the trained model to indices, **tag text** or **tag a random document**. * You can also **edit** (rename), **retrain**, **clone & edit** or **delete** the tagger task. .. image:: /_static/bert-loss.png :scale: 60% An example of epoch reports. This shows the *accuracy*, *training loss* and *validation loss* during training. These are metrics to evaluate machine learning models. .. image:: /_static/bert-on-text-tag2.png :scale: 80% **Tagging text** example: simple. .. image:: /_static/bert-on-text-tag1.png :scale: 80% **Tagging text** example: multiclass output shows results for the other labels as well. .. image:: /_static/bert-on-text-tag3.png :scale: 80% **Tagging text** example: getting attributions: reasoning what tokens or words BERT found to give this result. .. note:: When applying or tagging with BERT tagger make sure that the fields don't have a mismatch. If you want to use BERT on regular (tokenized) text, then train it on regular text, not lemmas. The mismatch can affect results. .. image:: /_static/bert-confusion.png :scale: 40% An example of a confusion matrix. Usually (when training binary examples) a confusion matrix has only four boxes, showing Predicted label and True label intersection, which can be used to find what category the model works best on and what needs more or less examples. .. image:: /_static/bert-apply.png :scale: 60% Applying the model to an index. Consider that applying the model will create new facts to the selected index. The field selected should be as close to the trained BERT as possible (so if you train the model on lemmas, use should also be on lemmatized text). Add a unique fact name, but leave fact value empty, the model can generate the fact name. .. image:: /_static/bert-results.png In this example, the facts BERT learned were not very useful, since all it learned was that most documents are neutral. Given more training and test examples, it would be much more useful. |br| ------ Taggers ======= **Taggers** work very similarly to BERT, but tagger uses a vectorizer and classifier. Taggers are usually trained to detect binary problems (problems with only two possible answers), whether something is true or not. Taggers are most useful for training on keywords or for topic modeling. 1. To create a Tagger, you should have an index with a fact you want to use for finding the same category in another index. You should also have an :ref:`embedding ` to use. 2. Click on **Taggers** in the Models menu. Click **Create**. .. image:: /_static/tagger-train1.png :scale: 60% 3. Fill in the description, the indices used for training, the field to train on and the fact name present in the index or indices. 4. Additional options are to balance the different labels in the facts. 5. You can also use a query or stopwords/lexicon. .. image:: /_static/tagger-train2.png :scale: 60% 6. Select the embedding you wish to use. 7. You can also additionally use a snowball stemmer to normalize the texts, if you wish. If not and you want to use the stemmer, it would be good to use the **Detect language** option. 8. After this, select the scoring function, which is used while evaluating the results. The options are: * default * precision * recall * f1_score * accuracy * jaccard You can keep this at default if you don't know which to choose. 9. Select the analyzer: if the text should be analyzed as words or characters. 10. Select the vectorizer and classifier. Basically you are choosing the algorithms to use. More information about these choices here: https://scikit-learn.org/stable/modules/feature_extraction.html. 12. Define the maximum and minimum size. If you have any facts that occur in less documents than the minimum, they won't be used for training. The maximum is the amount of documents overall used for training (if you have any more, then only the maximum wil be used for training). 12. You can also use the negative multiplier if you wish. .. image:: /_static/tagger-f1.png F1 results for the created Tagger. .. image:: /_static/tagger-actions.png A list of actions for the Tagger task: we can **list features** (for example when training using the count vectorizer), **apply to indices**, **tag text**, **tag doc**, **tag random doc**, **display confusion matrix**, **edit** (rename task), add **stop words**, **retrain tagger**, **clone & edit** or **delete** task. .. image:: /_static/tagger-apply.png :scale: 60% Applying the Tagger on an index. .. image:: /_static/tagger-results.png Results are as expected, since an overwhelming amount of test data was neutral, Tagger has the same result. .. |br| raw:: html